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Abstract In this chapter, we present the optimization formulation of the Kalman 
filtering and smoothing problems, and use this perspective to develop a variety of 
extensions and applications. We first formulate classic Kalman smoothing as a least 
squares problem, highlight special structure, and show that the classic filtering and 
smoothing algorithms are equivalent to a particular algorithm for solving this prob- 
lem. Once this equivalence is established, we present extensions of Kalman smooth- 
ing to systems with nonlinear process and measurement models, systems with linear 
and nonlinear inequality constraints, systems with outliers in the measurements or 
sudden changes in the state, and systems where the sparsity of the state sequence 
must be accounted for. All extensions preserve the computational efficiency of the 
classic algorithms, and most of the extensions are illustrated with numerical exam- 
ples, which are part of an open source Kalman smoothing Matlab/Octave package. 



1 Introduction 



Kalman filtering and smoothing methods form a broad category of computational 
algorithms used for inference on noisy dynamical systems. Over the last fifty years, 
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these algorithms have become a gold standard in a range of applications, including 
space exploration, missile guidance systems, general tracking and navigation, and 
weather prediction. In 2009, Rudolf Kalman received the National Medal of Science 
from President Obama for the invention of the Kalman filter. Numerous books and 
papers have been written on these methods and their extensions, addressing modifi- 
cations for use in nonlinear systems, smoothing data over time intervals, improving 
algorithm robustness to bad measurements, and many other topics. 




Fig. 1 Dynamic systems amenable to Kalman smoothing methods. 

The classic Kalman filter |29| is almost always presented as a set of recursive 
equations, and the classic Rauch-Tung-Striebel (RTS) fixed-interval smoother [42 1 
is typically formulated as two coupled Kalman filters. An elegant derivation based 
on projections onto spaces spanned by random variables can be found in Q. In 
this chapter, we use the terms 'Kalman filter' and 'Kalman smoother' much more 
broadly, including any method of inference on any dynamical system fitting the 
graphical representation of Figure [TJ Specific mathematical extensions we con- 
sider include 

• Nonlinear process and measurement models. 

• Inequality state space constraints. 

• Different statistical models for process and measurement errors. 

• Sparsity constraints. 

We also show numerous applications of these extensions. 

The key to designing tractable inference methods for the above applications is 
an optimization viewpoint, which we develop in the classic Kalman smoothing case 
and then use to formulate and solve all of the above extensions. Though it has been 
known for many years that the Kalman filter provides the maximum a posteriori 
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estimate for linear systems subject to Gaussian noise, the optimization perspective 
underlying this idea has not been fully deployed across engineering applications. 
Notably, several groups (starting in 1977) have discovered and used variants of this 
perspective to implement extensions to Kalman filtering and smoothing, including 
singular filtering ( ll39l [33l l40l ). robust smoothing ( ll22l [7)), nonlinear smoothing 
with inequality state space constraints (|9j El), and sparse Kalman smoothing [lj. 

We focus exclusively on smoothing here, leaving online applications of these 
ideas to future work (see [41 1 for an example of using a smoother for an online ap- 
plication). We start by presenting the classic RTS smoothing algorithm in Section|2j 
and show that the well-known recursive equations are really an algorithm to solve a 
least squares system with special structure. Once this is clear, it becomes much eas- 
ier to discuss novel extensions, since as long as special structure is preserved, their 
computational cost is on par with the classic smoother (or, put another way, the clas- 
sic smoothing equations are viewed as a particular way to solve key subproblems in 
the extended approaches). 

In the subsequent sections, we build novel extensions, briefly review theory, dis- 
cuss the special structure, and present numerical examples for a variety of applica- 
tions. In Section[3] we formulate the problem for smoothing with nonlinear process 
and measurement models, and show how to solve it. In Section |4] we show how 
state space constraints can be incorporated, and the resulting problem solved using 
interior point techniques. In Section [5] we review two recent Kalman smoothing 
formulations that are highly robust to measurement errors. Finally, in Section [6j 
we review recent work in sparse Kalman smoothing, and show how sparsity can 
be incorporated into the other extensions. We end the chapter with discussion in 
Section|7] 



2 Optimization Formulation and RTS Smoother 
2.1 Probabilistic model 

The model corresponding to Figure[T]is specified as follows: 
xi = gl(*o)+Wi, 

Xk = &fc(x k -i)+w k k = 2,...,N, (1) 
z k = h k (x k )+\ k k=l,...,N, 

where w k , v k are mutually independent random variables with known positive 
definite covariance matrices Q k and R k , respectively. We have x k ,w k € K", and 
z k , v k € R m W , so measurement dimensions can vary between time points. The clas- 
sic case is obtained by making the following assumptions: 

1 . xq is known, and g k , h k are known linear functions, which we denote by 



gk(xk-i) = Gk*k- 1 fa (x k ) = H k x k 



(2) 



4 



Aleksandr Y. Aravkin and James V. Burke and Gianluigi Pillonetto 



where G k € W x " and H k € 
2- Wt, Vt are mutually independent Gaussian random variables. 

In later sections, we will show how to relax these classic assumptions, and what 
gains can be achieved once they are relaxed. In this section, we will formulate esti- 
mation of the entire state sequence, X\,X2, ■ ■ ■ ,Xn, as an optimization problem, and 
show how the RTS smoother solves it. 



2.2 Maximum a posteriori formulation 



To begin, we formulate the maximum a posteriori (MAP) problem under linear and 
Gaussian assumptions. Using Bayes' theorem, we have 

P({xk}\{z k })~P({z k }\{x k })P({x k }) 

k=\ 

N 1 _ 0) 

oc Y[ exp ( - - (z fc - H k x k ) J R k 1 (z k - H k x k ) 
k=\ l 

1 - - ^T^-l 



-{xk-GkXk-i) Q k (xk-GkXk-i) 
A better (equivalent) formulation to ^ is minimizing its negative log posterior: 

N l l 

min/({**}) : = Y. ~{zk - H k x k ) J R,; 1 (z k - H k x k ) + - (x k - G k x k _i) T Q^ 1 (x k - G k x k _x) 
W k= \ £ z 

(4) 

To simplify the problem, we now introduce data structures that capture the entire 
state sequence, measurement sequence, covariance matrices, and initial conditions. 
Given a sequence of column vectors {u k } and matrices {T k } we use the notation 



vec^Mj) 



u N 



, diag({7i}) 



Ti 

T 2 

■■■ 







■. 

T N 



We now make the following definitions: 
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R = diag({i? t }) x = vec({x k }) 
e = diag({g,}) M ; = vec({g ,0 I ...,0}) G = 
H = diag({/4}) z = vec({z[ ,z 2 , ■ ■ ■ ,Zjv}) 

(5) 

where go-^gi (xq) = G\xq. 

With definitions in |5]), problem Q can be written 

mmf(x) = ^\\Hx-z\\ 2 R - 1 + ^\\Gx-wf Q _ 1 , (6) 

where ||<z||ff = a T Ma. We knew the MAP was a least squares problem already, but 
now the structure is fully transparent. In fact, we can write down the closed form 
solution by taking the gradient of |6| and setting it equal to 0: 

= H T R- 1 {Hx -z)+G J Q- 1 {Gx - w) 
= (H T R- 1 H + G T Q- 1 G)x-H T R- 1 z-G T Q- 1 w . 

The smoothing estimate is therefore given by solving the linear system 

(H T R- 1 H + G T Q~ 1 G)x = H J R- l z + G J QT l w. (7) 



I 

G2 I 



■■ 

-G N I 



2.3 Special subproblem structure 

The linear system in |7]) has a very special structure: it is a symmetric positive def- 
inite block tridiagonal matrix. This can be immediately observed from the fact that 
both G and Q are positive definite. To be specific, it is given by 



C=(H r Rr l H + G T Qr x G) 



CiA\ 
A 2 C 2 Aj 

■. ■. ■. 

A N C N 



(8) 



with Ak G ' 



and Ck £ 



defined as follows: 



Ck = Ql 



■G k+1 Q k+1 Gk+\ +HjR k l H k 



(9) 



The special structure of the matrix C in ([H} can be exploited to solve the linear 
system equivalent to the Kalman smoother. While a structure-agnostic matrix in- 
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version scheme has complexity <9(n 3 Af 3 ), exploiting the block tri diagonal structure 
reduces this complexity to 0{n^N). 

A straightforward algorithm for solving any symmetric positive definite block 
tridiagonal linear system is given in ifTUl . We review it here, since it is essential to 
build the connection to the standard viewpoint of the RTS smoother. 



2.4 Block tridiagonal (BT) algorithm 



Suppose for k = 1, . . . ,N, € R" 



e R nxe , n e B" xt , and for k = 2, . . . ,N, 



inxi 



a k eR n 



We define the corresponding block tridiagonal system of equations 
/ci o| ••• 0\ 



a 2 c 2 






"N-l cjv-i aJi 
a N c N J 



( e x \ 




1 n \ 















(10) 



The following algorithm for (jTOJ is given in [10, Algorithm 4]. 
Algorithm 2.1 The inputs to this algorithm are {a^}, {c^}, and {r^}. The output is 



a sequence {e^} that solves equation (10\. 

1 . Set d\ = c\ and s\ =r\. 

2. For k = 2,...,N, set d k = c k - ajd~\a k , s k = r k - ajd~\s k -i. 

3. Set en = d^SN- 

4. Fork = N-l,...,l, set e k = d^ 1 (s k - a k+l e k+i ). 

Note that after the first two steps of Algorithm |2.1| we have arrived at a linear system 
equivalent to (jTOJ but upper triangular: 



(d\ a\ 
d 2 













1 ei \ 




( Si \ 




e 2 




s 2 




err-i 








\ e N / 







(11) 



d N -i a N 
d N/ 

The last two steps of the algorithm then simply back-solve for the e k . 



2.5 Equivalence of Algorithm ( |2.1[ ) to Kalman Filter and RTS 
Smoother 

Looking at the very first block, we now substitute in the Kalman data structures |9]l 



into step 2 of Algorithm 2. 1 
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d 2 = c 2 — a^d l l a 2 



= e^-(e 2 1 G 2 ) T 



Q^+HjR^H x +GJQ2 l G 2 



[Qz l G 2 ) +HjR^H 2 +GlQ i l G i 



(12) 

These relationships can be seen quickly from [5, Theorem 2.2.7]. The matrices P^ k , 
Pm-i are common to the Kalman filter framework: they represent covariances of 
the state at time k given the the measurements {z\ , • • • ,Zk}, and the covariance of the 
a priori state estimate at time k given measurements {z\ , . . . ,Zk-i}, respectively. 
From the above computation, we see that 

d 2 =p-'+G]Qi 1 G 3 . 



By induction, it is easy to see that in fact 

d k = P k\k + G ~k+lQkl\ G k+\ ■ 

We can play the same game with s^. Keeping in mind that r = H T R~ l z + G J Q~ l w, 
we have 

s 2 = r 2 -ajd^r 1 



■'HjR 2 1 z 2 + {Q 2 1 G 2 ) T 



2i +#i Ri'Hi+GiQi l G 2 



hJRi 1 zi+G]p^x q 



(13) 



These relationships also follow from ||5] Theorem 2.2.7]. The quantities a 2 | i an d «2|2 
are from the information filtering literature, and are less commonly known: they are 
preconditioned estimates 

a k\k=Pk\k x ki a k\k-l = Pk\k-l x k\k-\ ■ (14) 

Again, by induction we have precisely that = am. 



When you put all of this together, you see that step 3 of Algorithm 2. 1 is given 

by 



e N = d N l s N = (P N * N + o) P N \ N *k\k = x k \k , ( 15 ) 

so in fact is the Kalman filter estimate (and the RTS smoother estimate) for time 
point N. 
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Step 4 of Algorithm |2. 1 | then implements the backward Kalman filter, computing 
the smoothed estimates x k \ N by back-substitution. Therefore the RTS smoother is 
Algorithm [2T] applied to 0. 

The consequences are profound — instead of working with the kinds expressions 
seen in ( fT3] l and ( p"2| ), we can think at a high level, focusing on d6j, and simply using 
Algorithm |2T| (or variants) as a subroutine. As will become apparent, the key to all 
extensions is preserving the block tridiagonal structure in the subproblems, so that 
Algorithm 2. 1 can be used. 



2.6 Numerical Example: Tracking a Smooth Signal 



2r 




-1.5 



1 2 3 4 5 6 7 

Fig. 2 Tracking a smooth signal (sine wave) using a generic linear process model 116) and di- 
rect (noisy) measurements j!8} . Red solid line is true signal, blue dashed line is Kalman (RTS) 
smoother estimate. Measurements are displayed as circles. 



In this example, we focus on a very useful and simple model: the process model 
for a smooth signal. Smooth signals arise in a range of applications: physics-based 
models, biological data, and financial data all have some inherent smoothness. 

A surprisingly versatile technique for modeling any such process is to treat it as 
integrated Brownian motion. We illustrate on a scalar time series x. We introduce 
a new derivative state i, with process model ±k+\ = Xk + Wk , and then model the 
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signal x or interest as Xk+i = Xk+XkAt + . Thus we obtain an augmented (2D) 
state with process model 



(16) 



Xk+l 




I 




x k 


+ 




x k+l_ 




At I 




Xk_ 







Using a well-known connection to stochastic differential equations (see 
[TTI0 we use covariance matrix 



Q k 



At At 2 /2 
At 2 /2 At 3 /3 



(17) 



Model equations ( fT6) > and < fT7| > can be applied as a process model for any smooth 
process. For our numerical example, we take direct measurements of the sin func- 
tion, which is very smooth. Our measurement model therefore is 



Zk 



HkXk + Vk, flit =[0 1] 



(18) 



The resulting fit is shown in Figure 2.6 The measurements guide the estimate 
to the true smooth time series, giving very nice results. The figure was generated 
using the ckbs package |6], specifically using the example file af f ine_ok . m. 
Measurement errors were generated using Rk = .35 2 , and this value was given to the 
smoother. The a 2 in ( fTTj ) was taken to be 1 . The program and example are available 
for download from COIN-OR. 



3 Nonlinear Process and Measurement Models 

In the previous section, we have shown that when gk and hk in model ([TJ are linear, 
and Vk,Wk are Gaussian, then the smoother is equivalent to solving a least squares 
problem We have also shown that the filter estimates appear as intermediate 



results when one uses Algorithm 2. 1 to solve the problem. 

In this section, we turn to the case where gk and hk are nonlinear. We first formu- 
late the smoothing problem as a maximum a posteriori (MAP) problem, and show 
that it is a nonlinear least squares (NLLS) problem. To set up for later sections, we 
also introduce the broader class of convex composite problems. 

We then review the standard Gauss-Newton method in the broader context of 
convex composite models, and show that when applied to the NLLS problem, each 
iteration is equivalent to solving ([6}, and therefore to a full execution of the RTS 
smoother. We also show how to use a simple line search to guarantee convergence 
of the method to a local optimum of the MAP problem. 

This powerful approach, known for at least 20 years l2Tl[T2l l9l. is rarely used in 
practice; instead practitioners favor the EKF or the UKF [18 28 1, neither of which 
converge to a (local) MAP solution. MAP approaches work very well for a broad 
range of applications, and it is not clear why one would throw away an efficient 
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MAP solver in favor of another scheme. To our knowledge, the optimization (MAP) 
approach has never been included in a performance comparison of 'cutting edge' 
methods, such as l34ll . While such a comparison is not in the scope of this work, 
we lay the foundation by providing a straightforward exposition of the optimization 
approach and a reproducible numerical illustration (with publicly available code) 
for smoothing the Van Der Pol oscillator, a well known problem where the process 
model is a nonlinear ODE. 



3.1 Nonlinear Smoother Formulation and Structure 



In order to develop a notation analogous to we define functions g : W' N —> 

R n(N+l) an( j h . R nN ^ R Af wkh M = ^ ^ from components gk an( J h k as follows. 







h\(xi) 


X2~g2(xi) 


, h(x) = 


hi(x2) 










h N {x N )_ 



(1) 



by 



With this notation, the MAP problem, obtained exactly as in Section 2.2 is given 

(2) 



min/(x) = l -\\g(x)- W \\ 2 Q ^ +\\\h{x)-z\\ 2 R _ l 



where z and w are exactly as in Q, so that z is the entire vector of measurements, and 
w contains the initial estimate gi (xq) in the first n entries, and zeros in the remaining 
n(N — 1) entries. 

We have formulated the nonlinear smoothing problem as a nonlinear least- 
squares (NLLS) problem — compare |2]l with ([SJ. We take this opportunity to note 
that NLLS problems are a special example of a more general structure. Objective Q 
may be written as a composition of a convex function p with a smooth function F: 



f(x)=p(F(x)) 



where 



~\\y2\\ R -i , 



F(x) 



g(x)- 
h(x) 



(3) 



(4) 



As we show in the next sub-section, problems of general form ([3]l can be solved 
using the Gauss-Newton method, which is typically associated specifically with 
NLLS problems. Presenting the Gauss-Newton right away in the more general set- 
ting will make it easier to understand extensions in the following sections of the 
chapter. 
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3.2 Gauss-Newton Method for Convex Composite Models 

The Gauss-Newton method can be used to solve problems of the form ([3}, and it 
uses a very simple strategy: iteratively linearizing the smooth function F 031. 
More specifically, the Gauss-Newton method is an iterative method of the form 

x v+l =x v + y v d v , (5) 
where d v is the Gauss-Newton search direction, and y v is a scalar that guarantees 

f(x v+l )<f(x v ). (6) 
The direction d v is obtained by solving the subproblem 

d v =argmin/(d) := p (f(jc v ) + VF(x v )V) . (7) 

We then set 

Af(x v )=f(d v )-f(x v ). 
By |[T5l Lemma 2.3, Theorem 3.6], 

f'(x v ;d v )<Af(x v )<0, (8) 

with equality if and only if x v is a first-order stationary point for /. This implies that 
a suitable stopping criteria for the algorithm is the condition Af(x v ) ~ 0. Moreover, 
x v is not a first-order stationary point for /, then the direction d v is a direction of 
strict descent for / at x v . 

Once the direction d v is obtained with Af(x v ) < 0, a step-size y v is obtained by a 
standard backtracking line-search procedure: pick a values < A < 1 and < K < 1 
(e.g., A = 0.5 and K = 0.001) and evaluate f(x v + X s d v ), s = 0, 1,2, . . . , until 

f(x v +X*d v ) < f(x v ) + KX°Af(x v ) (9) 

is satisfied for some s, then set y v = X s and make the GN update Q. The fact 
that there is a finite value of s for which |9| is satisfied follows from inequality 
f'(x v ;d v ) < Af(x v ) < 0. The inequality Q is called the Armijo inequality. A gen- 
eral convergence theory for this algorithm as well as a wide range of others is found 
in [15]. For the NLLS case, the situation is simple, since p is a quadratic, and stan- 
dard convergence theory is given for example in [27 1. However, the more general 
theory is essential in the later sections. 
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3.3 Details for Kalman Smoothing 



To implement the Gauss-Newton method described above, one must compute the 
solution d v to the Gauss-Newton subproblem Q for pi. That is, one must compute 




However, the problem (jTOJ has exactly the same structure as |6]); a fact that we have 
emphasized by defining 



w v :=w-g(x v ) , z v =z-h(x v ). (12) 

Therefore, we can solve it efficiently by using Algorithm |2.1| 

The linearization step in ( fT0| ) should remind the reader of the EKF. Note, how- 
ever, that the Gauss-Newton method is iterative, and we iterate until convergence to 
a local minimum of Q. We also linearize along the entire state space sequence x v 
at once in (jTOJ, rather than re-linearizing as we make our way through the xY's. 



3.4 Numerical Example: Van Der Pol Oscillator 

The Van der Pol oscillator is a popular nonlinear process for comparing Kalman 
filters, see 11241 and 11301 Section 4.1]. The oscillator is governed by a nonlinear 
ODE model 

X 1 {t)=X 2 {t) and X 2 {t)=ll[l-X l {tf]X 2 {t)-X l {t) . (13) 

In contrast to the linear model ( fT6] l, which was a generic process for a smooth signal, 
we now take the Euler discretization of ( fT3} to be the specific process model for this 
situation. 

Given X \ ) =Xk-i the Euler approximation forX(f^_i +At) is 

i \ ( x\,k-i+X2,k-iAt \ 

gk(x k -i) = [ X2k _ l+Ml -xl^-x^At ) ■ (14) 
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5 5 10 15 20 25 30 

Time(s) 

Fig. 3 Tracking the Van Der Pol Osciallator using a nonlinear process model \14) and direct 
(noisy) measurements \16\ of X\ -component only. Black solid line is true signal, blue dashed line 
is nonlinear Kalman smoother estimate. Measurements are displayed as circles. 
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For the simulation, the 'ground truth' is obtained from a stochastic Euler approxima- 
tion of the Van der Pol oscillator. To be specific, with fi = 2, N = 80 and At = 30/N, 
the ground truth state vector xt at time tt = kAt is given by xq = (0, — 0.5) T and for 
k=l,...,N, 

Xk = gk{xk-\)+Wk , (15) 

where {wk} is a realization of independent Gaussian noise with variance 0.01 and gt 
is given in ( fB) . Our process model for state transitions is also (TT5J, with = 0.01 / 
for k > 1, and so is identical to the model used to simulate the ground truth {x^}. 
Thus, we have precise knowledge of the process that generated the ground truth 
{xk}. The initial state xq is imprecisely specified by setting g\ (xq) = (0.1, — 0.4) T ^ 
xo with corresponding variance Q\ =0.1 /. For k= l,...,N noisy measurements Zk 
direct measurements of the first component only were used 



3t=*U + Vjfc, (16) 



withv*~iV(0,l). 



The resulting fit is shown in Figure 3.4 Despite the noisy measurements of only 



X\ , we are able to get a good fit for both components. The figure was generated using 
the ckbs package |6|, see the file vanderpol_experiment_simple . m. The 
program and example are available for download from COIN-OR. 



4 State space constraints 

In almost every real-world problem, additional prior information is known about 
the state. In many cases, this information can be represented using state space con- 
straints. For example, in tracking physical bodies, we often know (roughly or ap- 
proximately) the topography of the terrain; this information can be encoded as a 
simple box constraint on the state. We may also know physical limitations (e.g. 
maximum acceleration or velocity) of objects being tracked, or hard bounds set by 
biological or financial systems. These and many other examples can be formulated 
using state space constraints. The ability to incorporate this information is particu- 
larly useful when measurements are inaccurate or far between. 

In this section, we first show how to add affine inequality constraints to the affine 
smoother formulation in Section [2] This requires a novel methodology: interior 
point (IP) methods, an important topic in optimization [49, 32, 37|. IP methods work 
directly with optimality conditions, so we derive these conditions for the smoothing 
problem. Rather than review theoretical results about IP methods, we give a gen- 
eral overview and show how they specialize to the linear constrained smoother. The 
constrained Kalman smoother was originally proposed in 1111 . but we improve on 
that work here, and present a simplified algorithm, which is also faster and more 
numerically stable. We illustrate the algorithm using a numerical example, building 
on the example in Section[2] 
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Once the linear smoother with linear inequality constraints is understood, we re- 
view the constrained nonlinear smoother (which can have nonlinear process, mea- 
surement, and constraint functions). Using [11 1 and references therein, we show 
that the constrained nonlinear smoother is iteratively solved using linear constrained 
smoothing subproblems, analogously to how the nonlinear smoother in Section|3]is 
iteratively solved using linear smoothing subproblems from Section [2] Because of 
this hierarchy, the improvements to the affine algorithm immediately carry over to 
the nonlinear case. We end with a nonlinear constrained numerical example. 



4.1 Linear Constrained Formulation 

We start with the linear smoothing problem d6j, and impose linear inequality con- 
straints on the state space x: 

B k x k < b k . (1) 

By choosing the matrix B k and b k appropriately, one can ensure x k lies in any poly- 
hedral set, since such a set is defined by a finite intersection of hyperplanes. Box 
constraints, one of the simplest and useful tools for modeling (l k < x k < u k ) can be 
imposed via 













I 





In order to formulate the problem for the entire state space sequence, we define 

B = dmg({B k }), b=vec({b k }), (2) 

and all of the constraints can be written simultaneously as Bx < b. The constrained 
optimization problem is now given by 

min/(x)= I||tf*- Z ||2_ 1 + i||G*-Hle-i 
subject to Bx + s = b, s > . 

Note that we have rewritten the inequality constraint as an equality constraint by 
introducing a new 'slack' variable s. 

We derive the Kamsh-Kuhn-Tucker (KKT) conditions using the Lagrangian for- 
mulation. The Lagrangian corresponding to Q is given by 

JSf(je,a,*) = -\\Hx-z\\ 2 R -i +-\\Gx-w\\ 2 Q ^ +u J {Bx + s-b) . (4) 

The KKT conditions are now obtained by differentiating ££ with respect to its argu- 
ments. Recall that the gradient of (|6]l is given by 

{H^R^H + G T Q- 1 G)x - H J R- l z - G T Q- 1 w . 
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As in ([8} set C = H 1 R~ X H + G 1 QT l G, and for convenience set 

c=H T R- l z + G J Q- l w (5) 

The KKT necessary and sufficient conditions for optimality are given by 

V*_Sf = Cx + c + B T u = 

V 9 «Sf = Bx + s-b = (6) 
U[S\ = V; ;ui,Si > . 

The last set of nonlinear equations is known as complementarity conditions. In 
primal-dual interior point methods, the key idea for solving ([3]l is to successively 
solve relaxations of the system (|6]l that converge to a triplet (x, u, s) which satisfy |6]). 



4.2 Interior Point Approach 



IP methods work directly to find solutions of d5). They do so by iteratively relaxing 
the complementarity conditions = to = jJ. as they drive the relaxation 
parameter jj. to 0. The relaxed KKT system is defined by 



s+Bx-b 

SUl-nl 

Cx + B T u-c 



(7) 



where S and U are diagonal matrices with s and u on the diagonal, and so the second 
equation in implements the relaxation = ji of |6|. Note that the relaxation 
requires that > for all i. Since the solution to ([3]) is found by driving the KKT 
system to 0, at every iteration IP methods attempt to drive F^ to by Newton's 
method for root finding. 

Newton's root finding method solves the linear system 



Fh >{s,u,x) 



As 
Au 
Ax 



-F^(s,u,x) 



(8) 



It is important to see the full details of solving ([8]) in order to see why it is so effective 
for constrained Kalman smoothing. The full system is given by 



(9) 



"/ 


B~ 




'As' 




s + Bx-b 


u s 







Au 






B T 


c 




Ax 




Cx + B T u — c 



Applying the row operations 
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row2 ^— row2 — C/rowi 
row3 <— row3 — B T S~ row2 



we obtain the equivalent system 



7 B 




'As' 




s+Bx-b 


S -UB 




Au 




-U{Bx-b)-nl 


C+B T S- l UB 




Ax 




Cx + B T u-c + B T S- 1 (U(Bx-b)+lll) 



In order to find the update for Ax, we have to solve the system 
(C + B T S~ 1 UB) Ax = Cx + B T u-c + B T S~ 1 (17 (Bx - 



Ml) (10) 



Note the structure of the matrix in the LHS of ( [T0| >. The matrix C is the same as 
in |6|, so it is positive definite symmetric block tridiagonal. The matrices S~ 1 and U 
are diagonal, and we always ensure they have only positive elements. The matrices 
B and B T are both block diagonal. Therefore, C + B T S~ l UB has the same structure 
as C, and we can solve ( fTO) using Algorithm 2.1 



Once we have Ax, the remaining two updates are obtained by back-solving: 

Au = US- 1 (B(x + Ax)-b) + - (11) 



and 



As = -s + b-B(x+Ax) 



(12) 



This approach improves the algorithm presented in ifTTI solely by changing the 
order of variables and equations in (|7j. This approach simplifies the derivation while 
also improving speed and numerical stability. 

It remains to explain how ji is taken to 0. There are several strategies, see ll49l[32l 
[37ll . For the Kalman smoothing application, we use one of the simplest: for two out 
of every three iterations fj. is aggressively taken to by the update ji = ju/ 10; while 
in the remaining iterations, jj, is unchanged. In practice, one seldom needs more than 
10 interior point iterations; therefore the constrained linear smoother performs at a 
constant multiple of work of the linear smoother. 



4.3 Two Linear Numerical Examples 



In this section, we present to simple examples, both with linear constraints. 



Constant Box Constraints 



In the first example, we impose box constraints in the example of Section 
Specifically, we take advantage of the fact the state is bounded: 



2.6 
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"0 2 4 6 8 10 12 



Fig. 4 Two examples of linear constraints. Black solid line is true signal, magenta dash-dot line 
is unconstrained Kalman smoother, and blue dashed line is the constrained Kalman smoother. 
Measurements are displayed as circles, and bounds are shown as green horizontal lines. In the 
left panel, note that performance of the bounded smoother is significantly better around time 4- 
10 — the unconstrained is fooled by the measurements at times 4 and 8. In the right panel, as the 
oscillations die down due to damping, the measurement variance remains unchanged, so it becomes 
much more difficult to track the signal without the bound constraints. 
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[-1] < M [i] 

We can encode this information in form ([TJ) with 



B k = 



1 

-1 



, h = 



19 
(13) 

(14) 



We contrast the performance of the constrained linear smoother with that of the 
linear smoother without constraints. To show the advantages of modeling with con- 
straints, we increase the measurement noise in both situations to a 2 = 1. The results 



are show in Figure 4.2 The constrained smoother avoids some of the problems en- 
countered by the unconstrained smoother. Of particular interest are the middle and 
end parts of the track, where the unconstrained smoother goes far afield because 
of bad measurement. The constrained smoother is able track portions of the track 
extremely well, having avoided the bad measurements with the aid of the bound 
constraints. The figure was generated using the ckbs package [6|, specifically us- 
ing the example file af f ine_ok_boxC . m. 



Variable Box Constraints 

In the second example, we impose time-varying constraints on the state. Specifically, 
we track an exponentially bounded signal with a linear trend: 

exp(— ar)sin(/Jr) + .It 



using the 'smooth signal' process model and direct measurements, as in Section 2.6 
The challenge here is that as the oscillations start to die down because of the expo- 
nential damping, the variance of the measurements remains the same. We can im- 
prove the performance by giving the smoother the exponential damping terms as 
constraints. 

We included the second example to emphasize that 'linearity' of constraints 
means 'with respect to the state' ; in fact, the constraints in the second example are 
simply box constraints which are time dependent. The second example is no more 
complicated than the first one for the constrained smoother. 



4.4 Nonlinear Constrained Smoother 

We now consider the nonlinear constrained smoother, where we allow process func- 
tions gk, measurement functions % to be nonlinear, and also allow nonlinear smooth 
constraints ^k{ x k) < bk- To be consistent with the notation we use throughout the pa- 
per, we define a new function 
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&(* 2 ) 



(15) 



so that all the constraints can be written simultaneously as % (x) < b. 

The problem we would like to solve now is a constrained reformulation of (|2| 



1 9 1 9 

mi" f{x)= -\\g{x)-w\\^+-\\h{x)-z\\ £ R -x 



(16) 



subject to 4 (x) — b < . 
At this point, we come back to the convex-composite representation described 



in Section 3.1 The constraint ^(x) — b < may be represented using an additional 
term in the objective function: 



8(Ux)-b\R- 
where 8 (x\C) is the convex indicator function: 



xeC 
°° x ^ C 



8{x\C) 

Therefore, the objective ( fTo*| ) can be represented as follows: 



(17) 



(18) 



/(*) 


= p{F{x)) 






1) 


= ^lbille- 1 












g(x) - w 


F(x) 




h(x) —z 






${x)-b 



|b 2 ||2_ 1 + 5(>»3|M-) 



(19) 



The approach to nonlinear smoothing in ifTTI is essentially the Gauss-Newton 



method described in Section 3.2 applied to ( [T9| >. In other words, at each iteration 
V, the function F is linearized, and the direction finding subproblem is obtained by 
solving 



min - II G v d — w- 
d 2 



W)f Q -^^\\H v d-z-h{x v )\\\^ 



subject to B v d <b-%(x v ) 
V 



(20) 
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where G v and H v are exactly as in ( fTO) , B v = V x ^(x v ) is a block diagonal matrix 
because of the structure of % ( fT5| , and we have written the indicator function in ( fT9] l 
as an explicit constraint to emphasize the structure of the subproblem. 

Note that ( p0| ) has exactly the same structure as the linear constrained smoothing 
problem (|3), and therefore can be solved using the interior point approach in the 
previous section. Because the convex-composite objective ( fT9| i is not finite valued 
(due to the indicator function of the feasible set), to prove convergence of the non- 
linear smoother, ifTTl uses results from 1141 . We refer the interested reader to ifTTl 
Lemma 8, Theorem 9] for theoretical convergence results, and to IfTTl Algorithm 6] 
for the full algorithm, including line search details. 

Because of the hierarchical dependence of the nonlinear constrained smoother 
on the linear constrained smoother, the simplified improved approach we presented 
in Section 4.2 pays off even more in the nonlinear case, where it is used repeatedly 
as a subroutine. 



4.5 Nonlinear Constrained Example 



1.5r 




-1.5' ' ' 1 1 1 1 1 

1 2 3 4 5 6 7 

Fig. 5 Smoother results for ship tracking example with linear process model, nonlinear measure- 
ment model, and nonlinear constraints (with respect to the state). Black solid line is true state, 
red triangles denote the constraint, magenta dash-dot line is the unconstrained estimate, and blue 
dashed line gives the constrained nonlinear smoothed estimate. 
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The example in this section is reproduced from |11]. Consider the problem of 
tracking a ship traveling close to shore where we are given distance measurements 
from two fixed stations to the ship as well as the location of the shoreline. Distance 
to fixed stations is a nonlinear function, so the measurement model here is nonlinear. 

In addition, the corresponding constraint functions {f k } are not affine because 
the shoreline is not a straight line. For the purpose of simulating the measurements 
{zk}, the ship velocity [X\ (t),X?,(t )] and the ship position [^(f) ,X}(f)] are given by 

X(t) = [1 , t , -cos(r) , 1.3-sin(f) ] T 

Both components of the ship's position are modeled using the smooth signal model 



in Section 2.6 Therefore we introduce two velocity components, and the process 
model is given by 



G k 



The initial state estimate is given by g\ (xq) = X(t\) and Q\ = IOO/4 where I4 is the 
four by four identity matrix. The measurement variance is constant for this example 
and is denoted by <7 2 . The distance measurements are made from two stationary 
locations on shore. One is located at (0,0) and the other is located at (271,0). The 
measurement model is given by 



" 1 0" 




At At 1 !! 





At 1 


, Qk = 


At 2 /2 At 3 /3 





10 





At At 2 /2 


At 







At 2 /2 At 3 /3 



h k (x k ) 








We know that the ship does not cross land, so X/[{t) > 1.25 — sin[X2(?)]. This 
information is encoded by the constraints 

%k{xk) = 1.25-sin(x 2 ,fc) ~M,k < ■ 
The initial point for the smoother is [0,0,0, l] T ,which is not feasible. The results are 



plotted in Figure 4.5 The constrained smoother performs significantly better than 
the unconstrained smoother in this example. The experiment was done using the 
ckbs program, specifically see sine_wave_example . m. 



5 Robust Kalman smoothing 

In many applications, the probalistic model for the dynamics and/or the observa- 
tions ([T| is not well described by a Gaussian distribution. This occurs in the model 
for the observations when they are contaminated by outliers, or more generally, 
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when the measurement noise Vf, is heavy tailed 04), and it occurs in the model for 
the dynamics when tracking systems with rapidly changing dynamics, or jumps in 
the state values OTI . A robust Kalman filter or smoother is one that can obtain an 
acceptable estimate of the state when Gaussian assumptions are violated, and which 
continues to perform well when they are not violated. 

We show how to accommodate non-Gaussian densities by starting with a simple 
case of non-Gaussian heavy tailed measurement noise Vk Q- However, this general 
approach can be extended to as well. Heavy tailed measurement noise occurs in 
applications related to glint noise |25|, turbulence, asset returns, and sensor failure 
or machine malfunction. It can also occur in the presence of secondary noise sources 
or other kinds of data anomalies. Although it is possible to estimate a minimum vari- 
ance estimate of the state using stochastic simulation methods such as Markov chain 
Monte-Carlo (MCMC) or particle filters ll24l[35l , these methods are very computa- 
tionally intensive, and convergence often relies on heuristic techniques and is highly 
variable. The approach taken here is very different. It is based on the optimization 
perspective presented in the previous sections. We develop a method for computing 
the MAP estimate of the state sequence under the assumption that the observation 
noise comes from the l\ -Laplace density often used in robust estimation, e.g., see 
11231 equation 2.3]. As we will see, the resulting optimization problem will again 
be one of convex composite type allowing us to apply a Gauss-Newton strategy for 
computing the MAP estimate. Again, the key to a successful computational strategy 
is the preservation of the underlying tri-diagonal structure. 



5.1 An l\-Laplace Smoother 



For u € W n we use the notation ||m| 1 for the £1 norm of u; i.e., ||m||i = |z*i| +. . . + 
\u m \. The multivariate l\ -Laplace distribution with mean ji and covariance R has the 
following density: 



p(v*) = det(2K) 1/2 exp 



-V2 



R 



-1/2 



(vt-M) 



(1) 



where R l l 2 denotes a Cholesky factor of the positive definite matrix R; i.e., R l l 2 {R l l 2 ) T - 
R. One can verify that this is a probability distribution with covariance R using the 
change of variables m = 7? _1 / 2 (v<. — /z). A comparison of the Gaussian and Laplace 
distributions is displayed in Figure [6] This comparison includes the densities, nega- 
tive log densities, and influence functions, for both distributions. 



5.1.1 Maximum a posteriori formulation 



1, where 
. Under 
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Fig. 6 Gaussian and Laplace Densities, Negative Log Densities, and Influence Functions (for 
scalar v^) 



P{{x k }\{zk})~P({zk}\{x k })P({x k }) 

N 



■UP({v k })P({w k }) 
fc=l 

N 



I 1 

]~[exp(-\/2 R- l l 2 {z k ~h k (x k )) ^ - -(x k ~ g k (x k -i)) T Q k l (x k -g k (x k -i)) 



k=\ 



(2) 

Dropping terms that do not depend on {x k }, minimizing this MAP objective with 
respect to {x k } is equivalent to minimizing 



/({**}) :=V2£ R k l,2 [z k -h k {x k y ' A 



k=\ 



l 2 



\ x k - 8k(x k - 1 )] T Q k 1 [x k ~ gk(xk- 1 )] , 



k=\ 



where, as in ([TJ, xq is known and go = g\ (xq). Setting 



R = diag({^}) 

Q = diag({gfc}) 
x = vec({x k }) 
w = vec({g ,0,...,0}) 
Z = vec({zi,z 2 ,---,Zw}) 



rto 



*1 




h\{x\) 


*2-g2(*l) 




h2{x 2 ) 


, h(x) = 


_XN — gN(xN-l)_ 




h N (x N )_ 



as in |5j and ([T}, the MAP estimation problem is equivalent to 



minimize 
x G R*" 



/to = ilkto-HlcH W2 r'/^^w-z) 



0) 



(4) 



5.1.2 The Convex Composite Structure 

The objective in Q can again be written as a the composition of a convex function 
p with a smooth function F: 
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f(x)=p(F(x)) 

where 

'yi\ i 



V2 



^Ibill^.+^H/f-^lli, F(x) 



g(x)-w 
h(x) — z 



25 
(5) 

(6) 



Consequently, the generalized Gauss-Newton methodology described in Section 3.2 
again applies. That is, given an approximate solution x v to Q, we compute a new 
approximate solution of the form 



where d v solves the subproblem 



minimize p (F (x v ) + F 1 (x v )d), 

dew 



(7) 



and 7 V is computed using the backtracking line-search procedure described in Sec- 
3.2 Following the pattern described in ( fTO] ), the subproblem where p and 



tion 



F are given in |6]), has the form 



d v = argmin f{d) = - \\G v d — w — g(x 

d 2 s. 



where 



Q-> 



V2||^ 1/2 (//V-z-/z(x v ))|| 1 , 

(8) 



G v 



H v =dmg{h\ l) (x l ),...,h ( ! ; ) (x N )}. (9) 



5.1.3 Solving the Subproblem by Interior Point Methods 

By (|8j, the basic subproblem that must be solved takes the form 



min — \\Gd — w\\ 
d 2" " 



f2\\Br l l 2 {Hd-z) 



i • 



(10) 



where, as in Q, 



26 



Aleksandr Y. Aravkin and James V. Burke and Gianluigi Pillonetto 



R = diag({Je t }) 
e = diag({2,}) 
H = <&ag({H k }) 



x = vec({x jt }) 

w = vec({wi,W2,...,wjv}) 

z = vec({zi,z 2 ,...,zw}) 



G = 



I 

-G 2 I 



■■ 

-Gjv I 



(11) 

Using standard optimization techniques, one can introduce a pair of auxiliary non- 
negative variables p + ,p~ £ R M (M — Y, k =i m{k)) so that this problem can be rewrit- 
ten as 



minimize ^d T Cd + c T d + y/2 (p + +p~ 
w.r.t. d 

subject to Bd + b = p + — p~ 



(12) 



where 



C = G T Q 1 G = 



CiAj 
A 2 C 2 A] 

•. '•• '•. 

A N C N 



-Qk l G k 



Ck - Q k 1 +Gj +l Q k _l l G k+ i 
c — G T w 
B = R- 1/2 H 

-1/2, 



b=-R--z 

The problem ( |T2] > is a convex quadratic program. If we define 



F li{p + ,P ,s + ,s ,d) 



p+-p- -b-Bd 
diag(/j _ )diag(^)l-^l 

s++s- -2V2 
diag(/7+)diag(i + )l - fil 
Cd + c + B T (s- -s+)/2 



(13) 



for jj, > 0, then the KKT conditions for ( fT2| ) can be written as 

Fo(j> + ,P~,s + ,s~,d) = . 

The set of solutions to F jl (p + ,p~~ ,s + ,s~ ,d) = for jj. > is called the central path. 
We solve the system for ji = by an interior point strategy which, as described 
earlier, is a Newton based predictor-corrector method for following the central path 
as ji 4, 0. At each iteration of the interior point method we need to solve a system of 
the form 



F ^(p + ,P ,s + ,s ,d)+F^(p + ,p ,s + ,s ,d) 



Ap+ 
Ap- 
As+ 
As- 
Ay 
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where the vectors p + ,p ,s + , and s are componentwise strictly positive. Using 
standard methods of Gaussian elimination (as in Section |4~2") i, we obtain the solution 

Ay = [C + B T T- 1 B]-\e + B T T- 1 f) 

As' = T~ l BAy—T~ l f 

As + = -As' +2V2-s + - s~ 

Ap~ = diagfs - ) -1 [rl - diag(p~)As~] - p~ 

Ap + = Ap~ +BAy + b + By- p + + p~ , 

where 

d = Tl/s + -Tl/s~ -b-By + p + 

e = B T (V2-s~)-Cy-c 

f = c/-diag(5 + )" 1 diag(p+)(2\/2-^) 

T = diag(i + ) _1 diag(p + ) +diag(i _ ) _1 diag(/7 _ ) . 

Since the matrices T and B are block diagonal, the matrix B T TB is also block di- 
agonal. Consequently, the key matrix C + B T T~ 1 B has exactly the same form as the 
block tri -diagonal matrix in (jTOJ with 

c k = Qk l +G[ +l Q- k l x G k+l +HjT k - l H k k=l,...,N, 
cik = -QT k X G k k = 2,...,N, 



where T k = diag(s^)~ dmg(p k ) +diag(i^) _ diag(p^). Algorithm 2.1 can be ap- 
plied to solve this system accurately and stably with (9(n 3 A^) floating point opera- 
tions which preserves the efficiency of the classical Kalman Filter algorithm. 

Further discussion on how to incorporate approximate solutions to the quadratic 
programming subproblems can be found in [7 Section V]. 



5.1.4 A Linear Example 

In the linear case, the functions g k and h k is ([TJ are affine so that they equal their 
linearizations. In this case, the problems Q and |7]l are equivalent and only one 
subproblem of the form (jTOJ, or equivalently ( p"2] ), needs to be solved. We illustrate 
the ^i-Laplace smoother described in Section [5. 1.1 by applying it to the example 



studied in Section 2.6 except now the noise term v k is modeled using the l\ -Laplace 
density. The numerical experiment described below is take from [7, Section VI]. 

The numerical experiment uses two full periods of X(t) generated with jV = 100 
and At = An/N; i.e., discrete time points equally spaced over the interval [0,4tt]. 
For k — l,...,N the measurements z k were simulated by z k — ^lifk) + v k ■ m order 
to test the robustness of the t\ model to measurement noise containing outlier data, 
we generate v k as a mixture of two normals with p denoting the fraction of outlier 
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Table 1 Median MSE and 95% confidence intervals for the different estimation methods 



p 





GKF 


IGS 


ILS 







.34 (.24, .47) 


.04(.02, .1) 


.04(.01, .1) 


.1 


1 


.41(.26, .60) 


.06(.02, .12) 


.04(.02, .10) 


.1 


4 


.59(32, 1.1) 


.09(.04, .29) 


.05(.02, .12) 


.1 


10 


1.0(.42, 2.3) 


.17(.05, .55) 


.05(.02, .13) 


.1 


100 


6.8(1.7, 17.9) 


1.3(.30, 5.0) 


.05(.02, .14) 



contamination; i.e., 

v k ~(l-p)N(0,0.25)+pN(0,*). (14) 

This was done for p <G {0, 0.1} and <p e {1,4, 10, 100}. The model for the mean 
of Zk given x\ is hi c (x/ c ) = (0, = *2,fc • Here x^.k denotes the second component 
of Xk- The model for the variance of Zk given x\ is Rk = 0.25. This simulates a 
lack of knowledge of the distribution for the outliers; i.e, /?N(O,0). Note that we 
are recovering estimates for the smooth function — sin(f ) and its derivative — cos(/) 
using noisy measurements (with outliers) of the function values. 




Fig. 7 Simulation: measurements (+), outliers (o) (absolute residuals more than three standard 
deviations), true function (thick line), l\ -Laplace estimate (thin line), Gaussian estimate (dashed 
line), Gaussian outlier removal estimate (dotted line) 
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We simulated 1000 realizations of the sequence {zk} keeping the ground truth 
fixed, and for each realization, and each estimation method, we computed the corre- 
sponding state sequence estimate {i^}. The Mean Square Error (MSE) correspond- 
ing to such an estimate is defined by 

1 N 

MSE= - ^[xi, t -i lit ] 2 + [jC2.*-%] 2 , (15) 
n k=i 

where — X{tj/). In Table [I] the Gaussian Kalman Filter is denoted by (GKF), 
the Iterated Gaussian Smoother (IGS), and the Iterated t\ -Laplace Smoother (ILS). 
For each of these estimation techniques, each value of p, and each value of 0, the 
corresponding table entry is the median MSE followed by the centralized 95% con- 
fidence interval for the MSE. For this problem, the model functions {gk{ x k-\)} an d 
{h^Xj/)} are linear so the iterated smoothers IGS and ILS only require one iteration 
to estimate the sequence 

Note the l\ -Laplace smoother performs nearly as well as the Gaussian smoother 
at the nominal conditions (p = 0). The l\ -Laplace smoother performs better and 
more consistently in cases with data contamination ( p > .1 and > 1 ). It is also 
apparent that the smoothers perform better than the filters. 

Outlier detection and removal followed by refitting is a simple approach to robust 
estimation and can be applied to the smoothing problem. An inherent weakness of 
this approach is that the outlier detection is done using an initial fit which assumes 
outliers are not present. This can lead to good data being classified as outliers and 
result in over fitting the remaining data. An example of this is illustrated in Figure|7] 
which plots the estimation results for a realization of {zjj where p = 0.1 and (j) = 
100. Outlier removal also makes critical review of the model more difficult. A robust 
smoothing method with a consistent model, such as the l\ -Laplace smoother, does 
not suffer from these difficulties. 



5.1.5 Stochastic Nonlinear Process Example 

We now illustrate the behavior of the t\ -Laplace smoother on the Van Der Pol Oscil- 
lator described in Section 3.4 The numerical experiment we describe is taken from 
Section VI]. The corresponding nonlinear differential equation is 

^(0=X 2 (0 and X 2 (r) = ju (t) 2 }X 2 (t)-X l (t) . 

Given X(tk-\) =Xk-i the Euler approximation forX^k-i + At) is 

, \ ( *u-i +x%,k-\At \ 

Sk^k-i)-y X2k _ i + { ^ [l _ x i k]x2k _ Xyk}At j ■ 

For this simulation, the 'ground truth' is obtained from a stochastic Euler ap- 
proximation of the Van der Pol oscillator. To be specific, with jj. = 2, = 164 
and At = 16/ N, the ground truth state vector x% at time t^ — kAt is given by 
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x = (0, -0.5) T and for k = 1, . . . ,N, 

xk=gk{xk-i) + w k , (16) 

where {w^} is a realization of independent Gaussian noise with variance 0.01. Our 
model for state transitions ([TJ uses Qk — 0.01 / for k > 1, and so is identical to the 
model used to simulate the ground truth {x^}. Thus, we have precise knowledge of 
the process that generated the ground truth {x^}. The initial state xq is imprecisely 
specified by setting gi(xo) = (0.1,— 0.4) T ^ xq with corresponding variance Q\ = 
0.1/. 



Table 2 Median MSE over 1000 runs and confidence intervals containing 95% of MSE results 



p 





IGS 


ILS 







0.07 (0.06, 0.08) 


0.07 (0.06, 0.09) 


.1 


10 


0.07 (0.06, 0.10) 


0.07 (0.06, 0.09) 


.2 


10 


0.08 (0.06, 0.11) 


0.08 (0.06, 0.11) 


.3 


10 


0.08 (0.06, 0.11) 


0.08 (0.06, 0.11) 


.1 


100 


0.10(0.07,0.14) 


0.07 (0.06, 0.10) 


.2 


100 


0.12(0.07,0.40) 


0.08 (0.06, 0.11) 


.3 


100 


0.13 (0.09, 0.64) 


0.08 (0.07, 0.10) 


.1 


1000 


0.17 (0.11, 1.50) 


0.08 (0.06, 0.11) 


.2 


1000 


0.21 (0.14, 2.03) 


0.08 (0.06, 0.11) 


.3 


1000 


0.25 (0.17,2.66) 


0.09 (0.07, 0.12) 



For k — l,...,N the measurements Zk were simulated by Zk = + V* . The 
measurement noise Vk was generated as follows: 

v*~(l-/>)N(O,l.O)+/>N(O,0). (17) 

This was done for p e {0,0.1,0.2,0.3} and <p G {10, 100, 1000}. The model for the 
mean of Zk given x% is hk(xk) = (l,0)xk = X\k- As in the previous simulation, we 
simulated a lack of knowledge of the distribution for the outliers; i.e, pN(Q,<j>). In 
|[TJ), the model for the variance of Zk given Xk is Rk = 1 .0. 

We simulated 1000 realizations of the ground truth state sequence {x^} and the 
corresponding measurement sequence {zk}- For each realization, we computed the 
corresponding state sequence estimate {%} using both the IGS and IKS procedures. 
The Mean Square Error (MSE) corresponding to such an estimate is defined by 
equation ( [T5|, w here x^ is given by equation ( [To*} . The results of the simulation ap- 
pear in Table|2] As the proportion and variance of the outliers increase, the Gaussian 
smoother degrades, but the l\ -Laplace smoother is not affected. 

Figure [8] provides a visual illustration of one realization {x^} and its corre- 
sponding estimates {x~k}. The left two panels demonstrate that, when no outliers 
are present, both the IGS and ILS generate accurate estimates. Note that we only 
observe the first component of the state and that the variance of the observation is 
relatively large (see top two panels). The right two panels show what can go wrong 
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Nominal Measurement Errors 



o 

& 
o 
o 



20% of measurement errors are N(0, 100) 

5rO OO OOOOOO 



stochastic realization 
■ ■ ■ Gaussian Smoother 
■ i ■ 1 1 L1 Smoother 
O Data 




Time(s) 



Time(s) 



Fig. 8 The left two panels show estimation of x\ , (top) and (bottom) with errors from the nomi- 
nal model. The stochastic realization is represented by a thick black line; the Gaussian smoother is 
the blue dashed line, and the l\ -smoother is the magenta dash-dotted line. Right two panels show 
the same stochastic realization but with measurement errors now from (p, <f>) = (.2, 100). Outliers 
appear on the top and bottom boundary in the top right panel. 



when outliers are present. The Van der Pol oscillator can have sharp peaks as a result 
of the nonlinearity in its process model, and outliers in the measurements can 'trick' 
the IGS into these modes when they are not really present. In contrast, the Iterated 
£\ -Laplace Smoother avoids this problem. 
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5.2 Further Extensions with Log-Concave Densities 



Let us step back for a moment and examine a theme common to all of the variations 
on the Kalman smoother that we have examined thus far and compare the objective 
functions in (|4}, ((51, (|2}, (13}, and ( [To*} . In all cases, the objective function takes the 
form 

N 

Y, v k(Hxk)-z k ;Rk)+Jk{xk-g{xk-i);Qk) , (18) 

k=l 

where the mappings V* and Jk are associated with log-concave densities of the form 

Pv,k{z) « exp (-V k {z -Rk))) and p w , k {x) ^ exp {-J k (x; Q k )) 

with p v i, and p wk having covariance matrices R k and Q k , respectively. The choice 
of the penalty functions and J k reflect the underlying model for distribution of 
the observations and the state, respectively. In many applications, the functions V* 
and Jk are a members of the class of extended piecewise linear-quadratic penalty 
functions. 



5.2.1 Extended Linear-Quadratic Penalties 

Definition 1. For a nonempty polyhedral set U C M. m and a symmetric positive- 
semidefinite matrix M G M mxm (possibly M = 0), define the function 9 UM : R m — > 

{MU°o} :=Sby 

0u,m(w) := sup{(w,w) - \{u,Mu) \ . (19) 

u€U I 1 J 

Given and injective matrix B e M. mxn and a vector b e W", define p : W — > K as 

Pu,M,b,B (y) ■ = sup ueU {{u,b + By)~\{u, Mu) } . (20) 

All functions of the type specified in ([19} are called piecewise linear- quadratic 
(PLQ) penalty functions, and those of the form ( |20} are called extended piecewise 
linear- quadratic (EPLQ) penalty functions. 

Remark 1. PLQ penalty functions are extensively studied by Rockafellar and Wets 
in |43l . In particular, they present a full duality theory for optimizations problems 
based on these functions. 

It is easily seen that the penalty functions arising from both the Gaussian and 
l\ -Laplace distributions come from this EPLQ class. But so do other important den- 
sities such as the Huber and Vapnik densities. 

Examples: The £2, h> Huber, and Vapnik penalties are representable in the notation 
of DefinitionQ] 
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- V(x) = -Kx - i A' a : x < -K 
-V(x) = \x 1 ; -K<x<K 

- Viz) = Kx - \KH K < x 



\ 

\ 








V(x) = 


—x — e; x < — e 


V(x) = 


0; -e < x < e 


V(x) = 


x — e; e < x 





Fig. 9 Huber (left) and Vapnik (right) Penalties 



1. L 2 : Take U = E, M = 1, b = 0, and B = 1. We obtain p(y) = sup ( - -u' 

The function inside the sup is maximized at u = y, whence p (y) = ^y 2 . 

2. £r. Take U = [-1, 1], M = 0, b = 0, and B = 1. We obtain p(y) = " sup (wy) . 

«G [-1,1] 

The function inside the sup is maximized by taking u — sign(y), whence p(y) = 

\y\- 

3. Huber:TakeC7 = [-A',A'],M=l,fc = 0,andB=l.Weobtainp(y)= sup (uy- 
Take the derivative with respect to u and consider the following cases: 

a. If y < -K, take u = -K to obtain -Ky - \K 2 . 

b. If —K <y < K, take u = y to obtain ^y 2 . 

c. If y > K, take u = A" to obtain a contribution of Ay - 



\K 2 . 



This is the Huber penalty with parameter K, shown in the left panel of Fig. 1. 

j ] , and b — [Z|] , for some e > 



4. Vapnik: take U = [0, 1] x [0, 1], M = 
0. We obtain p(y) = sup„ liM26[01] / 
representation by considering three cases: 




.0 






r l 
.-i 


y- 


e 




»i 


-y 


-e 




"2 



We can obtain an explicit 



a. If |y | < e, take u\ =U2 = 0. Then p (y) = 0. 

b. If y > e, take u\ — 1 and «2 = 0. Then p(y) = y — e. 

c. If y < — e, take u\ = and «2 = 1. Then p(y) = — y — e. 

This is the Vapnik penalty with parameter e, shown in the right panel of Fig. 
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5.2.2 PLQ Densities 

We caution that not every EPLQ function is the negative log of a density function. 
For an ELQP function p to be associated with a density, the function exp(— p(x)) 
must be integrable on M". The integrability of exp(— p(x)) can be established under 
a coercivity hypothesis. 

Definition 2. A function p : R" — » IU {+°°} = K is said to be coercive (or 0- 
coercive) if limiiji^oapfx) = +°°. 



Since the functions Pu,M,b,B defined in ( |20| > are not necessarily finite-valued, their 
calculus must be treated with care. An important tool in this regard is the essential 
dominion. The essential domain of p : W — >• M. is the set 

dom(p) := {x : p(x) < +00} . 

The affine hull of dom(p) is the smallest affine set containing dom(p), where a set 
is affine if it is the translate of a subspace. 

Theorem 1. ^ Theorem 6] (PLQ Integrability). Let p := pu,M,b,B be defined as 
in P0[ ). Suppose p(y) is coercive, and let n a ff denote the dimension o/aff(dom p). 
Then the function f(y) = exp(— p(y)) is integrable on aff(dom p) with the n a ff- 
dimensional Lebesgue measure. 



Theorem 2. Theorem 7] ( Coercivity of p ). The function puju,b,B defined in ( [20 
is coercive if and only if [_B T cone({/)]° = {0}. 



If p := Pu.M.bM is coercive, then, by Theorem [T| then the function f(y) = 
exp(— p(y)) is integrable on aff(dom p) with the n a ff-dimensional Lebesgue mea- 
sure. If we define 

pW = K lex P^W) yedomp (21) 



else, 



where 



a = ( / exp(-p(y))dy 

'yGdom p 



and the integral is with respect to the Lebesgue measure with dimension n a ff, then p 
is a probability density on dom(p). We call these PLQ densities. 



5.2.3 PLQ Densities and Kalman Smoothing 

We now show how to build up the penalty functions and Jf, in ( fT8| using PLQ 
densities. We will do this for the linear model (jTJ-([2]) for simplicity. The nonlin- 
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ear case can be handled as before by applying the Gauss-Newton strategy to the 
underlying convex composite function. 

Using the notion given in (|5j, the linear model (|TJ-([2j) can be written as 

w = Gx+W m . 
z =Hx + \. { ' 

A general Kalman smoothing problem can be specified by assuming that the 
noises w and v in the model ( p2) have PLQ densities with means 0, variances Q and 
R (3J. Then, for suitable {Ug,M%,b]:,B%} and {U%,Aq,b v k ,B v k }, we have 



p(w) « exp{-e V w iM w(b w +B W Q-V 2 w)) 
p(v) - oip(-%» ) M»(6 v +B v *- 1 / 2 v)) , 



(23) 



where 



JJ * M»'=diag({Mr}) B v = diag({fi£}) 

M ^ M^diag^})' ^«jr 

^ * =vec({fc t }) 

Then the MAP estimator for jc in the model $22\ is 



. f/H ,M«^ + B lv fi- 1/2 (Gx-w))l 
arg mtn < > . (24) 

~ 1 +9u^(b v +B v R- 1 / 2 (Hx-z))\ 



Note that since and are independent, problem ( |24[ > is decomposable into 
a sum of terms analogous to ([ 1 8p . This special structure follows from the block 
diagonal structure of H,Q,R,B V ,B W , the bidiagonal structure of G, and the product 
structure of sets U w and U v , and is key in proving the linear complexity of the 
solution method we propose. 



5.2.4 Solving the Kalman Smoother Problem with PLQ Densities 



Recall that, when the sets U w and U v are polyhedral, (24i is an Extended Linear 
Quadratic program (ELQP), described in 11431 Example 11.43]. We solve ( p4| i by 
working directly with its associated Karush-Kuhn-Tucker (KKT) system. 

Lemma 1. [4. Lemma 3.1 ] Suppose that the sets Uj? and Ul are polyhedral, that is, 
they can be given the representation 

U? = {u\{Al) T u < <}, Ul = {u\(Al) T u < a\] . 

Then the first-order necessary and sufficient conditions for optimality in ( |24| > are 
given by 
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= (A w ) T u w + s w - a w ; = (A V )V + s v - a v 
= (s w ) T q w ; = {s v ) T q v 

= b w + B W Q- l / 2 Gx- M w u w - A w q w , (25) 

= P -B v R- l l z Hx-M v u v -A v q v 

= G T Q- T / 2 (B w ) T u w -H T R- J / 2 (B v ) T u v 

< s w ,s v ,q w ,q v . 

where b w = b w -B W Q- X l 2 w andb v = b v -B v RT l l 2 z. 



We propose solving the KKT conditions (|25]l by an Interior Point (IP) method. IP 



methods work by applying a damped Newton iteration to a relaxed version of ( 25 i 
where the relaxation is to the complementarity conditions. Specifically, we replace 
the complementarity conditions by 

(s w ) T q w = -> Q W S W 1 - nl = 
(s v ) T q v =0 -> e*'5 v l-Ail = 0, 

where Q w ,S W ,Q V ,S V are diagonal matrices with diagonals q w ,s w ,q v ,s v respectively. 
The parameter fi is aggressively decreased to as the IP iterations proceed. Typi- 
cally, no more than 10 or 20 iterations of the relaxed system are required to obtain 
a solution of ( |25] l, and hence an optimal solution to ( |24| i. The following theorem 
shows that the computational effort required (per IP iteration) is linear in the num- 
ber of time steps whatever PLQ density enters the state space model. 

Theorem 3. Theorem 3.2] (PLQ Kalman Smoother Theorem) Suppose that all 
Wk and Vk in the Kalman smoothing model ([T])-(|2| come from PLQ densities that 
satisfy Null(M) n U°° = {0}. Then an IP method can be applied to solve ( |24| l with a 
per iteration computational complexity of 0(Nn^ +Nm). 



The proof, which can be found in [4], shows that IP methods for solving |24| pre- 
serve the key block tridiagonal structure of the standard smoother. General smooth- 
ing estimates can therefore be computed in 0(Nn^) time, as long as the number of 
IP iterations is fixed (as it usually is in practice, to 10 or 20). 

It is important to observe that the motivating examples all satisfy the conditions of 
Theorem |3] 

Corollary 1. /@ Corollary 3.3] The densities corresponding to L l ,L 2 , Huber, and 
Vapnik penalties all satisfy the hypotheses of Theorem^ 

Proof: We verify that Null(M) n Null(A T ) = for each of the four penalties. In the 
L 2 case, M has full rank. For the L 1 , Huber, and Vapnik penalties, the respective sets 
U are bounded, so U°° = {0}. 
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5.2.5 Numerical example: Vapnik penalty and functional recovery 




Fig. 10 Simulation: measurements (•) with outliers plotted on axis limits (4 and —2), true function 
(continuous line), smoothed estimate using either the quadratic loss (dashed line, left panel) or the 
Vapnik's e-insensitive loss (dashed line, right panel) 



In this section we present a numerical example to illustrate the use of the Vap- 

in the Kalman smoothing context, for a functional 



nik penalty (see Figure 5.2.1 
recovery application. 

We consider the following function 



/(f)=exp[sin(8f)] 



taken from lfl9l . Our aim is to reconstruct / starting from 2000 noisy samples col- 
lected uniformly over the unit interval. The measurement noise v* was generated 
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using a mixture of two normals with p = 0. 1 denoting the fraction from each nor- 
mal; i.e., 

v fc ~(l- p )N(0,0.25)+/>N(0,25), 



10 



where N refers to the Normal distribution. Data are displayed as dots in Fig 
Note that the purpose of the second component of the normal mixture is to simulate 
outliers in the output data and that all the measurements exceeding vertical axis 
limits are plotted on upper and lower axis limits (4 and -2) to improve readability. 

The initial condition /(0) = 1 is assumed to be known, while the difference 
of the unknown function from the initial condition (i.e. /(•) — 1) is modeled as a 
Gaussian process given by an integrated Wiener process. This model captures the 
Bayesian interpretation of cubic smoothing splines [48], and admits a 2-dimensional 
state space representation where the first component of x(t), which models /(•) — 1, 
corresponds to the integral of the second state component, modelled as Brownian 
motion. To be more specific, letting At = 1/2000, the sampled version of the state 
space model (see [26, 38] for details) is defined by 



G k 



1 
At 1 



k = 2,3,..., 2000 



H k =[0l], k = 1,2,. ..,2000 



with the autocovariance of given by 



At 

,2 



At 1 



At z At 3 
2 3 



1,2,..., 2000. 



where X 2 is an unknown scale factor to be estimated from the data. 
The performance of two different Kalman smoothers are compared. The first (clas- 
sical) estimator uses a quadratic loss function to describe the negative log of the 
measurement noise density and contains only X 2 as unknown parameter. The sec- 
ond estimator is a Vapnik smoother relying on the e-insensitive loss, and so de- 
pends on two unknown parameters X 2 and e. In both of the cases, the unknown 
parameters are estimated by means of a cross validation strategy where the 2000 
measurements are randomly split into a training and a validation set of 1300 and 
700 data points, respectively. The Vapnik smoother was implemented by exploiting 
the efficient computational strategy described in the previous section, see HI for 
specific implementation details. In this way, for each value of X 2 and e contained 
in a 10 x 20 grid on [0.01, 10000] x [0, 1], with X 2 logarithmically spaced, the func- 
tion estimate was rapidly obtained by the new smoother applied to the training set. 
Then, the relative average prediction error on the validation set was computed, see 
Fig. 11 The parameters leading to the best prediction were X 2 = 2.15 x 10 3 and 
£ = 0.45, which give a sparse solution defined by fewer than 400 support vectors. 
The value of X 2 for the classical Kalman smoother was then estimated following 
the same strategy described above. In contrast to the Vapnik penalty, the quadratic 
loss does not induce any sparsity, so that, in this case, the number of support vectors 
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equals the size of the training set. 

The left and right panels of Fig.[lO]display the function estimate obtained using the 
quadratic and the Vapnik losses, respectively. It is clear that the Gaussian estimate is 
heavily affected by the outliers. In contrast, as expected, the estimate coming from 
the Vapnik based smoother performs well over the entire time period, and is virtually 
unaffected by the presence of large outliers. 




Fig. 11 Estimation of the smoothing filter parameters using the Vapnik loss. Average prediction 
error on the validation data set as a function of the variance process A 2 and e. 



6 Sparse Kalman smoothing 

In recent years, sparsity promoting formulations and algorithms have made a tremen- 
dous impact in signal processing, reconstruction algorithms, statistics, and inverse 
problems (see e.g. |[T3l and the references therein). In some contexts, rigorous math- 
ematical theory is available that can guarantee recovery from under-sampled sparse 
signals GUI . In addition, for many inverse problems, sparsity promoting optimiza- 
tion provides a way to exploit prior knowledge of the signal class as a way to im- 
prove the solution to an ill-posed problem, but conditions for recoverability have not 
yet been derived [36|. 

In the context of dynamic models, several sparse Kalman filters have been re- 
cently proposed ifTTl [T6l l47l [D . In the applications considered, in addition to pro- 
cess and measurement models, the state space is also known to be sparse. The aim is 
to improve recovery by incorporating sparse optimization techniques. Reference JT] 
is very close to the work presented in this section, since they formulate a sparsity 
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promoting optimization problem over the whole measurement sequence and solve 
it with an optimization technique shown to preserve computational efficiency. 

In this section, we formulate the sparse Kalman smoothing problem as an op- 
timization problem over the entire state space sequence, and suggest two new ap- 
proaches for the solution of such problems. The first approach is based on the inte- 
rior point methodology, and is a natural extension of the mathematics presented in 
earlier sections. 

The second approach is geared towards problems where the dimension n (state 
at a single time point) is large. For this case, we propose a matrix free approach, 
using a different (constrained) Kalman smoothing formulation, together with the 
projected gradient method. In both methods, the structure of the Kalman smoothing 
problem is exploited to achieve computational efficiency. 

We present theoretical development for the two approaches, leaving applications 
and numerical results to future work. 



6.1 Penalized Formulation and Interior Point Approach 

We consider only the linear smoother d§). A straight forward way to impose sparsity 
on the state is to augment this formulation with a 1-norm penalty: 

mjn/(jc) := l -\\Hx-z\\ 2 R -x + \\\Gx- +A||Wx||i , (1) 

where W is a diagonal weighting matrix included for modeling convenience. For 
example, the elements of W can be set to to exclude certain parts of the state di- 
mension from the sparse penalty. A straightforward constrained reformulation of ([TJ 
is 

min -\\Hx-z\\ 2 R - 1 + -\\Gx-w\\ 2 Q _ ] +n T y 
s.t. — y<Wx<y. 

Note that this is different from the constrained problem ([3]), because we have intro- 
duced a new variable y, with constraints in x and y. Nonetheless, an interior point 
approach may still be used to solve the resulting problem. We rewrite the constraint 
in ([8) using non-negative slack variables s, r. 

Wx-y+s = 

(3) 

-Wx-y + r = , 

and form the Lagrangian for the corresponding system: 

L(s,r,q,p,y 7 x) = x T Cx + c T x + Xl T y + q T (Wx-y + s) + p T (~Wx- y + r) , (4) 

with C as in ([8]) and c as in Q., and where q and p are the dual variables correspond- 
ing to the inequality constraints Wx < y and — Wx < —y, respectively. The (relaxed) 
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KKT system is therefore given by 



Ffi(s,r,q,p,y,x) 



(i) 



/s-y+Wx \ 

r — y — Wx 

D(s)D(q)l-fll 

D{r)D(p)\-pL\ 

XI — q — p 
\Wq-Wp + Cx + cJ 



(5) 



The derivative matrix F m is given by 



F« = 



and it is row equivalent to the system 



7 -7 
7 -7 
DO) D(q) 
D(r) D(p) 

c 



I 











-I 


w 





7 








-I 


-w 


D{q) 





D{s) 














Dip) 





D(r) 














-7 


I 














W 


-W 





c 



00 
00 



(6) 



w 
-w 

-D(q)W 
D{p)W 
-WW 
W<P~ l (<P 2 -V 2 )W 



where 



= D(s)- 1 D{q)+D{r)- l D(p) 
V = D(s)- l D(q)-D(r)- l D(p), 



(7) 



and the matrix <J> 2 — f 2 is diagonal, with the 2'2'th entry given by 4g,r,. Therefore, 
the modified system preserves the structure of C; specifically it is symmetric, block 
tridiagonal, and positive definite. The Newton iterations required by the interior 
point method can therefore be carried out, with each iteration having complexity 
0(n 3 N). 



6.2 Constrained Formulation and Projected Gradient Approach 



Consider again the linear smoother ((51, but now impose a 1-norm constraint rather 
than a penalty: 
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1 , 1 , 

rmn/(x) :=-\\Hx-z\\ £ R -i+ -\\Gx-w\\^ 

s.t. || Wjc|| i < x . 

This problem, which equivalent to ([l) for certain values of A and T, is precisely 
the LASSO problem [45 1, and can be written 

min^x T Cx + c T x s.t. ||Wjc||i < T . (9) 

with C e R" NxnN as in ^ and c e W N as in (|5j. When n is large, the interior point 
method proposed in the previous section may not be feasible, since it requires exact 
solutions of the system 

(C + W®- 1 (<P 2 — f 2 ) W)x = r , 



and the block-tridiagonal algorithm 2.1 requires the inversion of n x n systems. 

The problem |9} can be solved without inverting such systems, using the spectral 
projected gradient method, see e.g. J46, Algorithm 1]. Specifically, the gradient Cx + 
c must be repeatedly computed, and then x v — (Cx v + c) is projected onto the set 
I|Wjc||i < T. (the word 'spectral' refers to the fact that the Barzilai-Borwein line 
search is used to get the step length). 

In the case of the Kalman smoother, the gradient Cx + c can be computed in 
0{n 2 N) time, because of the special structure of C. Thus for large systems, the 
projected gradient method that exploits the structure of C affords significant savings 
per iteration relative to the interior point approach, 0(n 2 N) vs. 0(n 3 N), and relative 
to a method agnostic to the structure of C, 0(n 2 N) vs. 0(n 2 N 2 ). The projection onto 
the feasible set ||Wx||i < 1 can be done in 0(nN\og(nN)) time. 



7 Conclusions 

In this chapter, we have presented an optimization approach to Kalman smoothing, 



together with a survey of applications and extensions. In Section 2.5 we showed 



that the recursive Kalman filtering and smoothing algorithm is equivalent to algo- 



rithm 2.1 an efficient method to solve block tridiagonal positive definite systems. 
In the following sections, we used this algorithm as a subroutine, allowing us to 
present new ideas on a high level, without needing to explicitly write down modi- 
fied Kalman filtering and smoothing equations. 

We have presented extensions to nonlinear process and measurement models in 
Section [3] described constrained Kalman smoothing (both the linear and nonlinear 
cases) in Section [4] and presented an entire class of robust Kalman smoothers (de- 
rived by considering log-linear-quadratic densities) in Section [5] For all of these 
applications, nonlinearity in the process, measurements, and constraints can be han- 
dled by a generalized Gauss-Newton method that exploits the convex composite 
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structure discussed in Sections |3 . 1 1 and |4~4| The GN subproblem can be solved ei- 
ther in closed form or via an interior point approach; in both cases algorithm 2. 1 was 



used. For all of these extensions, numerical illustrations have also been presented, 
and most are available for public release through the ckbs package [6 |. 

In the case of the robust smoothers, it is possible to extend the density modeling 
approach by considering densities outside the log-concave class [3 |, but we do not 
discuss this work here. 

We ended the survey of extensions by considering two novel approaches to 
Kalman smoothing of sparse systems, for applications where modeling the sparsity 
of the state space sequence improves recovery. The first method built on the readers' 
familiarity with the interior point approach as a tool for the constrained extension in 
Section|4] The second method is suitable for large systems, where exact solution of 
the linear systems is not possible. Numerical illustrations of the methods have been 
left to future work. 
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